Applied Data Analysis in Python
What next¶
Further topics¶
Here are some additional chapters to work through on a number of different topics. Choose the ones that interest you.
Feature scaling and principal component analysis are important parts of many data analysis pipelines.
Clustering is an unsupervised classification method.
Image Clustering uses clustering techniques to demonstrate a form of image compression.
Other concepts¶
This course has provided a quick overview of some of the basic data analysis and machine learning tools available. Of course it could not cover the full breadth of possible topics so here I will give some pointers to things you may want to learn next.
Naïve bayes classification is a supervised classification method which works by attempting to a model by assuming that the data you present it with was created from that model originally. It uses Bayesian statistics to work out which model parameters best describe the distribution of data. Read more at scikit-learn.
Support vector machines is another supervised classification method which tries to find the dividing line between different classes. Read more at scikit-learn.
Decision trees is a supervised classification method which creates a tree of binary choices in order to assign a class to a data point. For example, on a population data set, the first question might be "is the person's height above 1.6 m". Depending on the answer to that, the next question asked may be different. The path through the tree depends on the exact details of the data point and so each leaf will be associated with a predicted class. Due to the large number of potential parameter combinations, DTs require more data that many other methods but are capable of creating a more nuanced response. Read more at scikit-learn.
Neural networks are the most widely know technique and are generally used as a classification tool for both supervised and unsupervised situations. They are very versatile and are often the first tool reached for by data scientists, even when there is a simpler method available.
K-Folds cross-validation is a more advanced technique for testing and validating your models. It greatly increases the time to fit your model but if you can afford it it is worth using. scikit-learn has built-in suport for it.
Further reading¶
- Python Data Science Handbook by Jake VanderPlas
- scikit-learn documentation
- Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition by Aurélien Géron.
Credits¶
This course was written by Matt Williams. All text is published under a Creative Commons Attribution 4.0 International License with all code snippets licensed as MIT.
The source for the material can be found on GitLab where fixes are welcome.